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Abstract 

We studied the effects of time-correlation of subsequent patterns on the con- 
vergence of on-line learning by a feedforward neural network with backpropa- 
gation algorithm. By using chaotic time series as sequences of correlated pat- 
terns, we found that the unexpected scaling of converging time with learning 
parameter emerges when time-correlated patterns accelerate learning process. 
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It has been reported |T]-§[ that time-correlation of input patterns often largely influences 
the convergence of on-line learning. As a concrete example, learning of chaotic map was 
shown to converge faster when patterns appeared in deterministic order of chaos than when 
patterns appeared randomly with the same 'probability density' with the chaotic time series 
This showed that on-line learning is sensitive to the order of subsequent patterns. But 
the influence of the time-correlation on the convergence of on-line learning has not been 
analyzed yet. 

If we express the input and output as vectors, supervised learning is a task to acquire 
the mapping relation: X p (g R n ) i— > Y p (e R n ) {p G N or R) where the set {X p , Y p } is called 
'pattern', and p is a pattern index. When the pattern index, p, is continuous, the number of 
patterns, L, is infinite. In gradient descent learning algorithms, the neural network system 
is updated as follows: 

= + SuJ n ^ 

where u n is a weight vector at discrete time, n, E n is a generalized error, which depends on 
the learning procedure, and e is a learning parameter. 

Among several learning rules, 'backpropagation' algorithm |4| , which is a natural exten- 
sion of steepest descent method to neural networks, is often used for its ability to realize the 
desired mapping relation in a network. The algorithm is originally formulated as an on-line 
learning procedure. The on-line procedure of the backpropagation can be divided into two 
kinds. The first one is a 'probabilistic on-line learning' (POL), which uses "local error", 
E Pn , in Eq.(l): E Pn (X Pn ,uj) = (<x(cl>) — Y Pn ) 2 /2, where a pattern index, p n , at discrete time, 
n, is drawn with pattern probability P p satisfying J2 p =iPp — 1 ; an d <? is an output of the 
network. 

On the other hand, time-correlated input patterns into the network are often used, as in 
the case of the time-series on-line learning. In such cases, the patterns may be presented in 
the deterministic order of appearance: p n+ i = f(p n ), where / is a map which produces the 
time series of pattern indices. We call this second on-line learning procedure as 'deterministic 
on-line learning' (DOL). Although we will mainly analyze, in DOL, the case that the target 
function and the map which makes the sequence of pattern index coincide, more generally 
one can use dynamics that is making sequences of patterns, different from the target function. 

In contrast to the on-line learning, we also discuss the 'global learning' (GL) which 
is a modified algorithm of POL. The algorithm uses "global error", E g i(uj), in Eq.(l): 
E g i(uj) = J E p (X p ,uj)p(p)dp (p G R), which is an averaged error over patterns, where p{p) 
is a probability density of the pattern with index, p. The algorithm often gets easier for 
analysis, because the error does not depend on the special pattern. 

Although on-line learning does not obey exact gradient descent process of global error as 
in global learning (GL), complete randomness of subsequent patterns in case of POL makes 
analytical approach possible in the context of master equations, which is approximated by 
Fokker-Planck equation in the limit of small learning parameters |§-|§. Exactly solvable 
models are also discussed in the literatures PJlOfl. 

Recently Wiegerinck and Heskes II] showed theoretically that time-correlation between 



subsequent patterns of on-line learning contributes to the diffusion term of a weight vector 
in the Fokker-Planck equation approximated from the equivalent equation as Eq.(l), and 
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suggested that the result may help to understand the accelerated on-line learning with time- 
correlated patterns found in [|l],|J. 

In this paper, we study how time-correlation of subsequent patterns effects on the con- 
vergence of learning, by comparative studies of the two on-line learning procedures; a) 
probabilistic on-line learning (POL) and b) deterministic on-line learning (DOL). We use 
the tent map in most cases as a target mapping relation, because the map makes this com- 
parative study easy. But the result is found to be similar for other maps. The tent map |I2 
is written as: 



x n+l = f(x n ) = r(l-2\x n -l/2\). (2) 

We use sequence of patterns which is produced by the tent map itself in DOL. When r = 1, 
the time series produced by the map has white Fourier spectrum and constant invariant den- 
sity between [0, 1] as same as the uniformly random number, [0, 1]; where the deterministic 
nature of chaotic correlation is expected to appear clearly in comparison with probabilistic 
randomness. 

Let us now consider a conventional feedforward neural network with an input, and output 
terminals and N — 2 hidden layers with M neurons. The output of 2-th neuron of the m-th 
layer of the network is as follows: 

yf = tanh^x-^o), 

yf = tanh(E^i42/|-^ ) ; 

; ; (3) 

where io\, ufp ■ ■ •, uf -1 are the synaptic weights connecting the input terminal to the second 
layer neurons, second to third layer, • • •, and (iV-l)th to output, a, respectively, c^ -1 is a 
bias term to z-th neuron of m-th layer. In this paper, we restrict ourselves for simplicity in 
the case that N = 4 and M = 3. The hidden layers (y 2 , y 3 , ■ ■ ■, y N ~ l ) have full inter-layer 
connections. The local error, E, is written: 

E(x n ,u) = (a(Cu)-f(x n )) 2 /2, (4) 

where f(x) is the functional relationship of the tent map. Global error is also used in on-line 
learning to evaluate how learning progresses, because global error does not depend on the 
special input pattern, x n . 

It is known that learning curves decrease suddenly between plateaus for many target 
functions and models. In case of this tent map function learning, there usually exists a 
critical time when the global error, E g i, decreases sharply, and the map learned by the 
network shifts abruptly from a constant to a tent @. Thus, one can easily define the 
converging time, t cr , when the global error crosses the geometrical mean between E g i on the 
first plateau and that on the second plateau (see Fig.l). The typical learning curves of the 
tent map function are shown in Fig.l. Generically, the three converging times of the tent 
map learning are found to satisfy the following inequality: t c cr < t T cr < t 9 cr , where t c cr , t r cr and 
t 9 cr are the converging times of DOL, POL and GL respectively. Notice that the invariant 
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density, p(x), of GL and that of POL are always made same as that of chaotic input (DOL) 
for comparative purpose. The order of three converging times is consistent with previous 
reports PHI. As one expects from the dynamical equations for weight vectors, the three 
converging times coincide for e — > 0. 

How is the effect of deterministic randomness of subsequent patterns, which follows the 
chaotic time series, related to that of probabilistic ones? First we concentrate ourselves on 
this problem to discuss the difference of converging time between DOL ((b) in Fig.l) and 
POL ((c) in Fig.l). Recent studies show that chaotic perturbation has anomalous effects on 
complex systems such as Hopfield model [13 1 and general multi-stable systems [13], even if 
the simple statistical quantities (mean, variance, probability density and Fourier spectrum) 
of chaos coincide with that of random noise. The effects are known to be related to the 
unstable fixed points of chaos. Chaotic force has transiently strong time-correlation when 
input pattern, x, is in the neighborhood of unstable fixed points; these are x* = and 
x* = 2/3 in the tent map with r = 1. The nearer is the input, x, injected to one of the 
unstable fixed points, the longer x stays in the neighborhood. Therefore the network of 
DOL sees biased (or, special) patterns for a while during which the input, x, stays several 
times in the vicinity of the unstable fixed point. In this period, the system moves to the 
direction continuously to reduce the special local error, E(x*), for a while, i.e. the system 
is largely moved without constraint of global error due to the unstable fixed points of the 
chaotic map. This phenomenon is easily verified by numerical simulation as in Fig. 2. It 
should be noticed that the direction of the motion of the weight vector in this period is not 
necessarily the one which reduces the global error, E g \. On the other hand, when the input, 
x, stays apart from an unstable fixed point, the sequence of input is almost as random as 
probabilistic; therefore the large change of weight vector in finite time steps is unlikely to 
occur, and the system is expected to move mostly along a gradient descent path of global 
error. 

The difference of time-correlation of input patterns affects the convergence of learning 
largely even when all the simple statistical quantities coincide between the tent map chaos 
and the uniform random as mentioned before. Therefore, it is required to clarify the effect 
of this chaotic time-correlation on the convergence of learning. In DOL, correlation range of 
input can be varied by the change of iteration number, N, as the selection rule of the sequence 
of patterns, x, as x n+ \ = f N (x n ) by fixing the target function, /, as x n+ i = f(x n ). In the 
strong chaos limit, N — > +oo, the time-correlation of the subsequent input, x, dissappers: 
the sequence of the input pattern is expected to be as random as probabilistic. Fig.|3] shows 
that the time-correlation of weak chaotic input (N = 1) certainly works to accelerate time 
series learning. Fast decay of the effect of time-correlation of subsequent patterns on the 
acceleration is observed: t c cr for N = 2 is nearly equal to that for N = 100. This is found 
to be consistent with the exponential decay of deterministic correlation with increasing N 



TM. Saturated value of t c cr is equivalent to the one given by the learning time for random 



input (POL). 

The effect of the time-correlation of the input on the function learning decreases with 
decreasing e, and it is completely annihilated in the adiabatic limit, e — > 0; where the change 
of weight vector per unit time is so small that the evolution of the system is shortly averaged 



over pattern indices fl5 ]. Therefore the dynamics of POL and that of DOL (and also GL) 



should coincide with each other in this limit. Equation. (1) indicates that the continuous 
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time, T, as used in Fokker-Planck description [11], should be proportional to e. Therefore 
the converging time, t cr , in the discrete model both for POL and for DOL should scale 
with e" 1 , and et cr should be independent of e in the e — > limit. However, it has not been 
understood how the finite learning parameter, e, affects the accelerated learning, that is how 
et cr should behave with e. 

One finds from the result of simulation (Fig.4) that the two normalized converging times, 
et cr , approach to the same value in the small learning parameter limit (e — ► 0). Approach 
of the normalized converging time to finite value in the limit shows that there is no local 
minimum in the learning process. If there are any local minima in the learning process, the 



normalized converging time must diverge to infinity as e — > | I6| . In POL, the normalized 
converging time increases monotonically with increase of the learning parameter, e. However, 
in DOL, the normalized converging time, et c crl decreases first with increase of e, and after 
some learning parameter, e opt , it increases monotonically. 

As known in general relaxation methods, finite stepping parameter, e, is harmful, be- 
cause the possibility of overshooting in phase space increases as e increases. Therefore, the 
normalized converging time is expected to increase monotonically with increasing learning 
parameter [|10] as the result of overshooting in a learning process without local minima. The 
simulation showed that this is the case for POL but not necessarily for DOL (Fig.4). The 
decrease of et r cr with increase of e was not observed in the simulations (Fig.4). On the other 
hand, decrease of et c cr was often found in chaotic patterns, not only in the learning of the 
tent map but also in that of logistic map with several parameters. 

As one notices, the reduction of converging time with increase of the learning parameter 
is possible when the system has to escape from local minima to reach the solution of learning 
|T(| . But the present system has no local minima. One might think it strange that the nor- 
malized converging time decreases as the learning parameters increases in a process without 
local minima. There should be an alternative which overcomes the harm of overshooting in 
the region < e < e opt in DOL. 

We found that the puzzle may be solved by noticing the fact that there are generically 
in learning process plural gradient descent paths to the solution. If chaotic correlation of 
subsequent patterns works effectively to find a shorter path to the solution by its diffusive 
motion of weight space, the observed phenomena are understandable. The possibility is 
strengthened by the fact that the system under chaotic patterns (DOL) should be largely 
moved away from the exact gradient descent direction of global error due to the unstable 
fixed points, which would facilitate the system to cross over the potential barrier between 
the gradient descent paths. 

The same order of diffusive motion against gradient descent direction of global error, as 
found in DOL (see, Fig.l), would be possible, in principle, even in POL, with larger learning 
parameter, e. However, increase of e strengthes the harm of overshooting simultaneously: 
the harm of overshooting may cancel the merit the diffusive motion in POL. In DOL, the 
harm of overshooting overcomes the merit of the diffusive motion when e go over e opt where 
the normalized converging time begins to increase. 

Finally, we mention an automatic reduction mechanism of the fluctuation of the system, 
which is characteristic of on-line learning and may weaken the harm of overshooting with 
a finite learning parameter. As discussed in some literatures [0,0, in "perfectly trainable 
networks" JT7H , in which E g i(u) = is available, the fluctuation in weight vector space 
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(equivalently, the diffusion rate in Fokker-Planck representation |f[~lj| ) becomes zero when 
the system reaches error-free (E g i = 0) state: the error-free state behaves as a "sink" of 
probability flow ||. The reduction of the fluctuation can also occur even if the network is 
not "perfectly trainable": the system should be stabilized when the residual error is small 
enough ]n| . 

We showed in this paper that the accelerated on-line learning with chaotic patterns 
is attributed to the unexpected scaling of the converging time with learning parameter, 
e: the converging time, t cr , decreases much faster than t cr « e _1 with increasing e even 
without local minima. The results may indicate the beneficial aspects of finite learning 
parameters of on-line learning with time-correlated patterns, because in any case one is 
forced to use finite learning parameters in realistic learning processes. The studies of the 
optimal time-correlation of general patterns and/or the optimal learning parameter for the 
network, together with the proof of acceleration mechanism, are under way. 
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FIGURES 



FIG. 1. Typical learning curves of the tent map function by three learning methods: a) global 
learning, b) deterministic on-line learning, c) probabilistic on-line learning. Invariant density of a) 
and c) are made the same as b). The initial conditions of the weight vectors are the same, e = 0.05 
and r = 0.95. 

FIG. 2. (a) Typical temporal evolution of input, x, found in DOL, where t < t c cr . (b) Corre- 
sponding time evolution of averaged velocities of a weight vector, u, in a finite time interval, T, 
where Sco = \Suj\ = \ti n +T/2 ~~ ^n-T/2\- Initial value of Co is drawn from uniformly random number 
[-0.05, 0.05], e = 0.05 and r = 0.9995. 

FIG. 3. Converging time, t^. r , versus several deterministic time-correlation of input patterns, 
x. Lyapunov exponent, A, of the sequence of input is: A = AHog2. In the (strong chaos) limit, 
N — > oo, the system is almost equivalent to POL with uniformly random input [0, 1]. Ensemble 
averages over 100 initials are shown. Initial value of u is drawn from uniformly random number 
[-0.05, 0.05], e = 0.05 and r = 0.9995. It is found that t c cr (N » 1) « t r cr (POL). 

FIG. 4. Dependence of normalized converging time, et cr , on learning parameter, e, for the tent 
map learning (solid line for DOL, dotted line for POL). Ensemble averages over 100 initials are 
shown. Initial value of Q is drawn from uniformly random number [-0.1, 0.1] and r = 0.9995. 
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